Efficient Pooling Designs for Library 

Screening 

W. J. Bmno^'^ E. Knill^'^ D. J. Balding^ 
D. C. Bmcei'4, N. A. Doggett^'^, W. W. Sawhill^'^, 
R. L. Stallings^ C. C. Whittaker^'^, and D. C. Torney^'^'^ 

February 8, 2008 



^ Center for Human Genome Studies 
Los Alamos National Laboratory 
Los Alamos, NM 87545 

^ T-10, Theoretical Biology and Biophysics, Mailstop K710 
Theoretical Division 

^ CIC-3, Computer Research and Applications, Mailstop K990 
Computing, Information, and Communications Division 

LS-2, Genomics and Structural Biology, Mailstop M880 
Life Sciences Division 

^ School of Mathematical Sciences 
Queen Mary and Westfield College 

University of London 

London, El 4NS, UK 

^ Department of Human Genetics 
University of Pittsburgh 
Pittsburgh, PA, 15261, USA 

^ Author for Correspondence 
dct @ipmati. lanl. gov 



1 



Abstract 



We describe efficient methods for screening clone libraries, based 
on pooling schemes which we call "random fc-sets designs". In these 
designs, the pools in which any clone occurs are equally likely to be 
any possible selection of k from the v pools. The values of k and v 
can be chosen to optimize desirable properties. Random fc-sets designs 
have substantial advantages over alternative pooling schemes: they are 
efficient, flexible, easy to specify, require fewer pools, and have error- 
correcting and error-detecting capabilities. In addition, screening can 
often be achieved in only one pass, thus facilitating automation. For 
design comparison, we assume a binomial distribution for the number 
of "positive" clones, with parameters n, the number of clones, and 
c, the coverage. We propose the expected number of resolved posi- 
tive clones — clones which are definitely positive based upon the pool 
assays — as a criterion for the efficiency of a pooling design. We deter- 
mine the value of k which is optimal, with respect to this criterion, 
as a function of v, n and c. We also describe superior /c-sets designs 
called fc-sets packing designs. As an illustration, we discuss a roboti- 
cally implemented design for a 2.5-fold-coverage, human chromosome 
16 YAC library of n = 1,298 clones. We also estimate the probabil- 
ity each clone is positive, given the pool-assay data and a model for 
experimental errors. 

1 Introduction 



Much of the current effort of the Human Genome Project involves the screen- 
ing of large recombinant DNA libraries in order to isolate clones containing 
a particular DNA sequence. This screening is important for disease-gene 
mapping and also for large-scale clone mapping jOlson, et al, 1989|| . More 



generally, efficient screening techniques can facilitate a broad range of basic 
and applied biological research. Whenever the objective is to find "needles 
in a haystack", a reliable test indicating whether or not at least one needle 
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occurs in a specific part of tlie liaystack can greatly facilitate the isolation 
of the needles ||Du and Hwang, 1993|| , [ Pyachov, 1979|| . Such tests are called 
"binary group tests". The most reliable binary group tests routinely used 
to screen groups of clones for a particular DNA sequence employ either a 
hybridization-based assay or a PCR-based, STS assay. 

Each group of clones is called a "pool", a group test is called a "pool 
assay" , and a collection of pools is called a "pooling design" . A convenient 
specification of a pooling design is a binary clone- 6y-pool incidence matrix: 
if a clone occurs in a pool the matrix element equals unity. Figure la depicts 
an incidence matrix for a small pooling design. 

Any clone containing a DNA sequence of interest is termed a "positive" . 
Pools yielding a positive assay are termed "positive" pools. For the time 
being we assume there are no experimental errors, in which case each pos- 
itive pool contains one or more positive clones. After the pools have been 
assayed, the clones fall into one of four conceptual categories, illustrated in 
Figure lb. If a negative clone occurs in at least one pool containing only 
negative clones then its negative status can be determined from the pool 
assays and it is called a "resolved negative" . The remaining negative clones 
are called an "unresolved negative". A positive clone is called a "resolved 
positive" if it occurs in at least one pool with no other positive clone and no 
unresolved negative clone, otherwise it is called "unresolved positive" . Thus, 
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the status of the resolved positive and resolved negative clones is resolved 
by the experiment. Although the experimenter will in principle be unable to 
distinguish the unresolved positive from the unresolved negative clones, as 
will be seen, it may be useful to separately analyze these two types of clones. 

The expected number of unresolved negative clones was previously pro- 
posed as a criterion for selecting pooling designs [[Barillot, et al., 1991|| . How- 
ever, the library-screening objective can range from the isolation of a small 
proportion of the positive clones to the isolation of all of the positive clones — 
hence, we propose the expected number of resolved positive clones as a cri- 
terion of design optimality. For example, a cDNA library typically contains 
multiple copies of many sequences, but the experimenter may want to identify 
only one of these clones. In Section |^ we calculate both these expectations 
for several proposed and implemented poohng designs, under the assump- 
tion of a binomially-distributed number of positive clones. Confirmatory 
tests are ordinarily performed on the candidate positive clones. The aver- 
age number of such tests that would be required to determine the status 
of all the clones equals the expected number of positive clones plus the ex- 
pected number of unresolved negative clones. The probability that a design 
yields a one-pass solution (i.e. the status of all clones is resolved) may also 
provide a useful criterion for design optimality. This criterion, introduced 
in [[Balding and 'Ibrney, in pres^], will be discussed in a future publication 
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[Balding, et al., submitted 



Our focus is on one-stage designs in which all of the pools are assayed 
in one pass. The use of one-stage designs facilitates automation because a 
robot can be fully programmed from the outset, irrespective of intermediate 
results. In addition, one-stage designs require the construction of fewer pools 
over multiple screenings, since the same pools are used for each screening. 
Standard pooling designs for large libraries have several stages, and they of- 
ten involve row-and-column pools, viz. Figure 2 and [ Amemiya, et ai, 1992 



Chumakov, et al., 1992|| , ||Evans and Lewis, 1989|| , ||Green and Olson, 1990 



Sloan, et al., 199^ . The examples of Section ^7I\ illustrate the differences be- 
tween these approaches for a hypothetical tenfold-coverage, human-genome 
library with 72,000 clones. 

We propose "random /c-sets" designs for one-stage library screening. In 
these designs, each clone occurs in k pools, and all choices of the k pools are 
equally likely [ pyachov, 1989| ]. Random /c-sets designs are easy to specify for 
any number of pools, and they are efficient, in terms of the expected num- 
ber of resolved positives, in comparison to alternative designs. The expected 
numbers of resolved negative and resolved positive clones are given in Section 



2]J. Using these formulas, the optimal choices of k for a given application 
can be determined. Numerical results are given for a range of clone library 
sizes and coverages in Section |^. (Although performance will vary accord- 
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ing to the specific instance of tlie random design, measures of performance 
are narrowly distributed between different instances for large designs; thus, 
expected measures of performance provide a useful guide). We also discuss 
techniques for constructing superior fc-sets designs, called "fc-sets packing" 
designs, in Section O. 



In Section p75| we describe techniques for ranking candidate positive clones, 
given the pooling design, the pool-assay data, and a model for experimental 
errors. Ranking methods are employed to illustrate the performance of ran- 
dom fc-sets designs in the presence of pool-assay errors, through computer 
simulation. 

We have previously described efficient techniques for manual construc- 
tion of pools ||McCormick, et ah, 1993b|l . The designs proposed here are in- 



tended for implementation using robots. We employed the Packard 204 mul- 
tiPROBE robot to pool a 1,298-clone, 2. 5- fold-coverage, chromosome-specific 
YAC library, implementing a four-sets design on 47 pools. 

1.1 Notation 

The number of clones in the library is denoted by n. The number of pools in 
a pooling design is denoted by f , and the number of pools in which a clone 
occurs is denoted by k. The number of clones in a clone library which cover 
any particular location of the cloned region is assumed to be binomially- 
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distributed with expectation c (called the coverage). The standard notation 
Pr(A|Z) is used for the conditional probability of event A, given event Z. 
We use the notation (j^ for the binomial coefficient a!/ (6! (a — &)!), and we 
introduce B for the binomial probabilities 



B{a,b,t)=i\t''{l-tr\ (1) 



2 Methods 



We developed several computer programs as part of our pooling methodology, 
described below. Copies of these programs are freely available from the 
author for correspondence. 

2.1 Pooling and Screening Methods for a 1,298-Clone 
YAC Library 

A complete-digest, Cla I library of 1,298 clones was generated from human 
chromosome 16 ||McCormick, et al., 1993a| . Based on an average cloned- 



DNA size of 200 kilobasepairs and a chromosome size of 98 megabasepairs 



Morton, 1991 1, the coverage of this library was conservatively estimated at 



c = 2.5. A four-sets packing design on f = 47 pools was used; see Section 



27^ for its characterization. Thus, each clone occurred in four of the pools. 
(With this value of f , two sets of pools and controls can be put on one 96-well 
dish). Deep-96-well mitrotiter dishes were filled with YPD medium using the 
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Dynatech plate-filler; each well contained 1.9 ml of medium. These dishes 
were incubated at 30° C for 48 hours to check for contamination prior to inoc- 
ulation. Then, 14 uncontaminated dishes were inoculated by hand stamping 
from a selective-medium copy of the library, lacking uracil and tryptophan. 
The inoculated dishes were incubated at 30° C for 72 hours, achieving sta- 
tionary growth. We subsequently used a Packard multiPROBE 204 robot 
to construct the pools. The robot cycled through the following procedure, 
using all four liquid-handling tips: suspend the yeast in a well, aspirate 0.8 
ml, dispense 0.4 ml in each of two specified pool tubes, and sterilize and clean 
the tips. (A larger syringe would have allowed us to aspirate the full well 
volume and would have reduced the overall time for pooling) . The sterilizing 
and tip cleaning began with a five-second rinsing with five ml of distilled 
water. Next, one ml of 0.525% sodium hypochlorite solution was aspirated 
and discharged. Then, the tips were rinsed with five ml of sterile water, and 
the tips were placed in individual cups to clean the outside as well. (These 
rinsing and sterilizing steps account for about half of the run time. They 
were sufficient to prevent cross-contamination on the basis of a yeast-growth 
assay). Final pool volumes of approximately 46 ml were collected in 50 ml 
centrifuge tubes. Pools were maintained at room temperature during the 
pooling run, which took approximately 12 hours. Human intervention was 
required hourly to replace the dishes from the arrayed library. 
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Agarose gel plugs containing yeast cells were prepared from each pool. 
The plugs were treated with zymolase to make spheroplasts, and they were 
treated with ESP (500 mM EDTS, 1% sarcosyl, and 1 mg/ml proteinase K), 
to digest away proteins. The plugs were dialyzed extensively against TE. 
The DNA was purified from the agarose using the Geneclean kit. GeneAmp 
PGR reagent kits were used for PGR screening of the pool DNA. (50 ng of 
pool DNA and 2.5 units of Taq polymerase and 1.5 mM MgGl2 were the 
noteworthy features of the 50 /il reaction volume). 

2.2 Methods for Programming Robots 

Two software systems, written in 'G', were developed to facilitate use of 
the Packard multiPROBE 204 robot. These systems were both used in the 



implementation described in Section 2.1 



Our "scheduling" software can readily be adapted for an arbitrary robot. 
This system enables the implementation of arbitrary pooling designs, viz. 
Figure la, creating a list of volumes for well-to-pool transfers. In addition, 
it lists the location, size, and quantity of "source" and "destination" plastic- 
ware. The scheduling system can be used to coordinate the simultaneous use 
of multiple robots engaged in constructing a set of pools from one library. The 
output of the scheduling system is used by a robot instruction "interpreter" , 
described next. 
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The interpreter program reads the scheduhng input file and generates a 
complete list of commands for the Packard robot. The basic actions of the 
robot are aspirating, dispensing, and rinsing. Each of these actions requires 
many steps and hence many robot commands. When the scheduling program 
provides the necessary parameters, the interpreter program generates these 
commands, which can then be implemented using HAM, the Packard-robot 
software controlling the monitor and the robot. 

For example, the scheduling program supplies the well number and the 
volume of aspiration, and the interpreter program generates the appropriate 
list of robot commands. The interpreter determines whether multiple tips 
can aspirate or dispense simultaneously. It also records all commands and 
actions and enables termination of the run at any time. 

2.3 /c-Sets Design Generation — /c-Sets Packings 

A four-sets design on 47 pools was constructed using a pseudo-random num- 
ber generator. The design was based on a random four-sets design; however, 
a prospective four-set was discarded if it had more than two elements in 
common with another four-set already in the design. In addition, we ensured 
that each pool contained approximately the same number of clones (109 to 
111). In Section ^]3| our performance criterion is used to compare the design 
we generated using these constraints to the average perfomance of a random 
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four-sets design. 

Designs with a bound on the size of the intersection between two fc-sets 
are referred to as "/c-sets packing" designs. In general, it is unsatisfactory to 
have two identical fc-sets in a random fc-sets design: if one of them is positive 
it would be impossible to establish the status of the other. Similarly, it is 
undesirable for two clones to coincide in a large number of pools (i.e. for the 
sets of pools each occurs in to be similar). Bounding the intersection ensures 
that a negative clone cannot be unresolved due to a small number of positive 
clones. 

2.4 Mathematical Methods 

In this section we give expressions for the measures of design performance — 
expected numbers of resolved negative and resolved positive clones — for ran- 
dom fc-sets designs. We assume that each clone is positive with probability 
c/n, independently of the other clones, which yields a binomial distribution 
for the number of positive clones. Although this may not be strictly realistic, 
it provides a reasonable approximation for the purposes of design comparison 
[ piarke and Carbon, 1976| , [[Lander and Waterman, 1988|| . The expressions 



we derive are readily modified to accommodate an arbitrary probability that 
there are p positive clones, provided that all p-subsets of the clones are equally 
likely to be the positive clones. Such a modification would be useful when 
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screening cDNA libraries, for example. Here we assume no experimental 



error. Pool-assay errors are treated in Section p75 . 

Recall the definition of random fc-sets designs: each clone occurs in pre- 
cisely k of the V pools, and each of the possible (^^^ subsets of k pools has 
equal probability to be the pools in which a particular clone will occur. 

We begin by formulating N, the expected number of unresolved negative 
clones — negative clones occurring only in positive pools; thus, n—c—N is the 
expected number of resolved negative clones. 

_ n k /L\ 

N = J2{n-p)B{n,p,c/n)J2i.]{-iyzf, (2) 

where 

and B denotes the binomial probabilities ([l|). Interchanging the order of 
summation in (H) and discarding some terms of order yields 

N^{n-c)j:(^]{-iye-<'-^^\ (3) 

Insight into the behavior of can be obtained from the "independent pools" 
approximation. The inner summation is an inclusion-exclusion formula for 
the probability, Kjf \ that all of the k pools in which a negative clone occurs 
contain at least one positive clone, given that there are p positive clones. The 
probability that a given pool is negative is {l—k/vY, and the independent 
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pools approximation uses 



^i'^-U-U-r) 1 • (4) 



k 



p\ ^ 



V 



The approximation would be exact if pool outcomes were independent. 
Because pool outcomes are negatively correlated, the independent pools ap- 
proximation gives an upper bound for xjf^ and hence N. The bound is tight 
for p not too small and improves as v increases. A recursive expression for 
xjF^ is given in Appendix ^ (Equation (||)). 

We now formulate P, the expected number of unresolved positive clones; 
thus, c—P is the expected number of resolved positive clones. The details 
of the calculation are given in Appendix 0. Recall that, in the absence of 
pool-assay errors, a positive clone is unresolved if it occurs only in pools 
containing either another positive clone or an unresolved negative clone. 

After substitution and performing the summations over p and u, Equation 
fllTf) becomes 



n-l \/ 



in which 

and a^^\ (3 and are defined at, respectively, (|^), (|T3p and (0), and Zi is 
defined below Equation (^. 
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It is seen that can be evaluated with 0{v'^) summands. To evaluate (|^) 
and (§) for large v, we used MAPLE V's extended-floating-point precision 
capabilities ||Char, et al, 1992| . In addition, we wrote our own extended- 



precision FORTRAN routines, which typically required about one percent as 
much c.p.u. time to obtain an equally accurate result. The evaluation can 
also be done without using extended precision: the alternating signs can be 
eliminated by using the recursive Equations (H) and (pISl). In this case the 
summations over p and u must be approximated numerically. 

We have also derived approximations for P that are more easily evaluated. 
If the number of positive pools were fixed at the expected number, u: 



then the probability a negative clone is unresolved negative is /i, given by 



^ \ / \ -1 



k)\k) ' 

Averaging over the number of unresolved negatives, with a Poisson ap- 
proximation to the binomial distribution, and using the "independent pools" 
approximation gives 



p=i 



V 
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A more accurate approximate formula is obtained if pools are not assumed 
to be independent: 

p=l j=o \^/ 

in which 

^-(T)(:)"" 

Equation (|) was used to generate Figures 3 and 4; standard "double preci- 
sion" floating-point operations were sufficient. We found, over the domain of 
Figure 3a, the largest difference from the exact minimum number of pools, 
from Equation (||), was about six percent of the number of pools. Also, the 
approximate result typically differs from the exact result by no more than 
one pool. 

2.5 Methods for Ranking Clones 

Although we focus on ranking the clones according to the probability that 
each is positive — given the pool-assay data and a model for the experimen- 
tal errors — ranking sets of candidate positive clones can also be a desirable 
objective. In the absence of errors, it is usually possible to identify many 
resolved positive clones, but it can also be important to rank the remaining 
clones according to the probability that they are positive. These objectives 
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can be achieved in the following framework. 

Bayes' rule can be used to estimate the probability that an individual 
clone is positive, given the vector of pool-assay outcomes, denoted V, and a 
model for the experimental errors. In this context, Bayes' rule can be written 
as follows: 



require a calculation for each possible subset of the clones, taken to be P, 
the set of positive clones. Because 2"^ is typically very large, this is not 
feasible, and, therefore, we sampled subsets to estimate the ratio. These 
estimates were then used to rank the clones. In our preliminary studies of 
this approach, we selected subsets in which each clone has an equal prob- 
ability of appearing. For the example described in Section |^ with 33,000 
clones, we sampled approximately 330,000 subsets for each clone, both for 
the numerator and denominator of the ratio. To reduce both the noise of 
and the computational work required for the sampling, individual clones 
were added to and removed from these two groups of subsets — one used for 
the numerator and the other for the denominator. However, a much more 
efficient approach is Gibbs sampling of P, and the Hastings-Metropolis al- 





in which /^^ (respectively Jj ) denotes the event that clone i is positive (re- 



spectively negative). To evaluate exactly the ratio on the RHS of (0) would 
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gorithm have also been implemented, facilitating the estimation of Equation 
(13) jBernardo and Smith, 1994|| , ||Bruno, et at., 1994|| . 

The model for experimental errors enters into the evaluation of the RHS 
of Equation (0). For each set P of selected clones, taken to be the positive 
clones, we find the union of the pools in which any of these clones occur. Call 
this set T. In the pool-assay data, V , let be the number of negative pools 
in T, and let be the number of positive pools not in T. Furthermore, 
a two parameter error model is adopted, with the error-rate parameters for 
false-positive and false-negative pool assays equal A+|_ and A_|+, respectively. 
These errors are taken to be independent; thus, the probabilities on the right- 
hand side of Equation (0) are evaluated using 



Pr(y|P) = - A+|_)''-l^l""+i-A"];+(l - A_|+)l^l-"-i+, 

in which |T| denotes the cardinality of T. 

3 Results 

3.1 Random /c-sets 

In this Section we employ our criterion — the expected number of resolved 
positive clones — to select optimum parameters for random fc-sets designs. 
For libraries of varying number of clones, n, and coverage, c, the minimum 
number of pools required for random fc-sets designs to achieve an expected 
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number of resolved positive clones equal to 0.5 c and 0.95 c is illustrated in 
Figures 3a and 4a, respectively. Figures 3b and 4b depict the corresponding 
optimum values of k — maximizing the expected number of resolved posi- 
tive clones for the number of pools plotted in Figure 3a and 4a. The latter 
plots depict a number of values of c at which the optimum value of k changes 
abruptly. At these values of c, the product of c with the probability that there 
are j or fewer positives is comparable to the desired expected number of re- 
solved positive clones. As c increases through each of these "transitions" , the 
optimal pooling design resolves the cases with one additional positive clone, 
in order to achieve the desired expected number of resolved positive clones. 
These transitions are most pronounced for small values of c — for example, 
where the optimal designs go from resolving only one positive to resolving 
two positives, over a small range of c centered on 0.69. These transitions 
influence the minimum number of pools, contributing to the irregularity of 
the contours in Figure 3a and Figure 4a in this region. For most values of 
the variables c and n, the dependence of the expected number of resolved 
positive clones upon k is not pronounced in the vicinity of the optimum value 
of k. Thus, the optimum k provides a rough guideline for efficient pooling 
designs, as can be seen in the following examples. 

Given a goal of five resolved positive clones, on the average, a tenfold- 
coverage, human-genome library of 33,000 clones ||Cohen, et al., 199^ could 
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be accommodated on 170 pools (Figure 3a). The optimum value of k is ten, 
which would result in an average of 1,941 clones per pool. The expected 
number of unresolved negative clones is approximately 44. If, instead, the 
goal were to have an an average of 9.5 resolved positive clones, then 253 
pools would be required (Figure 4a). The optimum value of k would also 
be ten and there would be an average of approximately 1,304 clones per 
pool. The expected number of unresolved negative clones is approximately 
2.8. Therefore, on the average, 3.3 confirmatory tests would be required to 
resolve the status of every clone. 

It may also be desirable to bound the average number of clones in a 
pool in order to avoid high pool-assay error rates. This can be achieved 
by constraining k, the number of pools containing any one clone. Suppose 
it were desirable to have fewer clones per pool, say, approximately 1,000, 
and also to achieve an average of five resolved positive clones for the library 
described in the previous paragraph. Then, from Equation (^, 191 pools 
would be required with k = Q, and the average number of clones in a pool 
would be approximately 1037. 

Experimental error necessitates more pools to achieve comparable results. 
We performed some computer-simulation experiments for a tenfold-coverage 
library of 33,000 clones with a false-negative error rate of 0.1 and a false- 
positive error rate of 0.01 — rates consistent with our preliminary experiments 
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on pools containing approximately 110 clones, described in Section p.3| . We 
used a random ten-sets design, and an arbitrary set of ten clones was selected 
to be the positive clones. The result of the Bayes' ranking, described in 



Section p.5| , was that five of these ten clones were ranked in the top ten. 
Thus, it is feasible to identify the positive clones — even with appreciable 
experimental error. 

3.2 Comparison with other designs 

To facilitate comparison, we consider a tenfold-coverage, human-genome li- 
brary of 72,000 clones for which the following row-and-column pooling design 
has been implemented [phumakov, et al., 1992|| . The library is partitioned 



into 94 lots, all but one containing eight 96-well dishes and the remaining 
one with six. The rows and columns of eight microtiter dishes are combined 
to construct 20 pools. In addition, eight more pools, each containing all of 
the clones from one dish, are constructed. Thus, the total number of pools is 
28x94 = 2, 632. For this design, a lot contains a resolved positive clone only if 
it contains only one positive clone. Therefore the expected number of resolved 
positive clones is approximately 10e~^°/^^ = 9.0, and the expected number of 
unresolved negative clones is approximately 3.3 [[Barillot, et al., 1991|| . The 



following designs have been proposed for screening the same library, using 
approximately one-tenth as many pools. 



20 



Barillot, et al., 1991|| proposed novel pooling designs for a library with 



the aforementioned parameters. One of these designs assigned each clone to 
a lattice point in a cubic, integer lattice with 43 points on each axis. Each 
pool would contain all of the clones with each coordinate: thus 3 x 43 = 129 
pools would result. To gain more information about the positive clones, a 
linear transformation was used to obtain a new configuration of the clones 
on the lattice points and, thus, another set of 129 pools, specified by the new 
coordinates. Thus, each cubic configuration yields three groups of 43 pools 
with the property that each clone occurs in one pool from each group. In 
general, this property of the Barillot et al. designs distinguishes them from 
the designs we propose. 

Computer simulation was used for the 258-pool cubic design to determine 
that the expected number of resolved positive clones is nearly 8.8. Also, the 
expected number of unresolved negative clones is approximately equal 13.3. 
(We assumed that 72,000 of the lattice sites were initially selected uniformly 
at random for the first configuration). 



For the same library, using the results of Section |2.4] , the optimum choice 
of k for a random /c-sets design on 258 pools is 11. In this case the expected 
number of resolved positive clones is 9.1 and the expected number of unre- 
solved negative clones is 5.3. However, if we choose = 6 to achieve pools 
comparable in size with the "cubic" design, then the expected number of re- 
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solved positive clones is 7.3, and the expected number of unresolved negative 
clones is 12.6. The fc-sets packing designs, described in Section p73| , would 
yield larger expected numbers of resolved positive clones. 

3.3 Screening results 

The following theoretical and computational results bear on the predicted 
performance of our four-sets packing design. As above, this library is as- 
sumed to have a binomial number of positive clones with expectation 2.5. 
Computer simulation was used to estimate the expected number of resolved 
positive clones for the four-sets packing design at approximately 1.47 — versus 
1.36, from (^, for random four-sets designs. Similarly, the expected number 
of unresolved negative clones for the four-sets packing design is 3.98 — versus 
4.68, from (^, for a random four-sets design. Thus, to identify all the posi- 
tives would require confirmatory testing of 6.48 clones, on the average. 

The four-sets packing design was implemented for the 1298-clone, hu- 
man chromosome 16, YAC library. We observed numerous false negative 
and also false positive pool assays, precluding the identification of clones on 
the basis of being either resolved positive or unresolved positive or negative. 
Twenty-two STSs were screened against the pools to achieve closure of the 
chromosome 16 framework map. We ranked the clones according to the prob- 
abihty of being positive, based on the pooling results, as described in Section 
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275| . We set the error rates A+|_ and A_|+ equal 0.14 and 0.06, respectively. 
These rates are primarily based on comparing the frequency of positive pools 
with that predicted using the coverage. After performing confirmatory test- 
ing on the top eleven clones (on average), an average of 1.8 positive clones 
were identified. Six of the twenty-two STSs yielded no positive clones. 

4 Discussion 

Improved methods for screening clone libraries will allow more efficient use of 
currently available biological resources. We propose using fc-sets designs for 
unique-sequence screening of large clone libraries. These designs are efficient, 
flexible, easy to specify and can allow screening in one-pass, thus minimiz- 
ing human intervention. When possible, we also advocate using the k-sets 



packing designs, discussed in Section 2.3, which can yield further substantial 



gains in efficiency. The automated implementation of a four-sets packing 
design for a YAC library containing 1,298 clones, over a period of 12 hours, 
demonstrates the utility of commercially-available robots. 

In addition to the expected number of unresolved negative clones, pro- 
posed by Barillot et al., we propose a new design performance criterion: the 
expected number of resolved positive clones. Optimizing a pooling design 
according to either criterion will be sensitive to the upper tail of the dis- 
tribution for the number of positive clones, which we have assumed to be 
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binomial for design comparison. For some applications, the determination of 
average behavior might not be adequate; thus, it may be useful, for example, 
to estimate the probability that j positives are resolved positive for a pooling 
design, given that there are p positives. In any case, pooling designs which 
have a high probability of achieving the screening objective are useful, even if 
there is no guaranteed performance for a particular STS. On the other hand, 
to guarantee that the status of all of the clones is resolved would require 
many more pools than given in Figure 4a. 

We computed the smallest number of pools required for a random k-sets 
design to achieve a given expected number of resolved positive clones. Figures 
3a and 4a depict the number of pools required for libraries of coverage c with 
n clones, and Figures 3b and 4b depict the optimum values of k. When 
implementing any pooling design, it could be informative to compare the 
total numbers of pools and the number of pools containing each clone to those 
plotted in Figures 3 and 4 because, after optimization, these designs come 
close to achieving optimal performance. One could use the Bayes ranking in 
several ways to find the best parameters for a random k-sets design in the 
presence of experimental error. For the time being, we selected a design from 
Figure 3 and simulated its performance in the presence of a realistic level of 
experimental error. 

Data on pool-assay errors is clearly important in the design of pooling 
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experiments. Pools with a smaller proportion and concentration of target 
DNA might be less likely to yield the correct PGR product. Also, some 
primers could be more prone to fail to produce products than others. False- 
positive results could result from cross- contamination. It is not clear that our 
simple model of experimental error — with independent probabilities for false- 
negative and for false-positive pool assays — is adequate. We will assess the 
fidelity of our pooling and screening experiments while assaying the four-sets 
packing pools for our physical map of human chromosome 16. 

Experimental error blurs the distinctions between clones in our four cat- 
egories and motivates both the consideration of error-correcting pooling de- 
signs [[Balding and Ibrney, in pres^| and effective ranking of the candidate- 
positive clones Pruno, et ai, 19"94[ . In general, an automated ranking al- 
gorithm will also propose a set of clones for confirmatory testing. It could 
propose several candidate clones and use the results from these confirmatory 
tests when proposing further candidate clones. One criterion for evaluating 
ranking algorithms could be the average number of proposed clones required 
to identify i of the positive clones or all of them, in the event that there 
are fewer than i positives. Or it could be the sum of the probabilities of 
being positive assigned to positive clones in computer simulations. One can 
optimize the design of the pooling experiments, given the selection of a rank- 
ing technique. In addition, the relative costs of confirmatory testing and of 
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missing positive clones could be employed as part of design optimization. 

It may often be possible to generate designs with better performance than 
random /c-sets designs. Such designs could include /c-sets packings, in which 
there is a bound, t, on the number of pools in which two clones coincide. 
By varying the value of t, k-sets packing designs can provide an extremely 
powerful and flexible approach to library screening. When t = k, we have 
random fc-sets designs which are efficient and very easy to construct. As t 
is decreased, we gain even greater efficiency at the cost of additional compu- 
tations in design generation. At the other extreme we have maximum-size 
fc-sets packings which are, we believe, maximally efficient but often difficult 
to construct. Further, we have shown [ [Balding and 'Ibrney, in press|| that 
packings which achieve the maximum possible size are best possible in some 
cases, in terms of maximizing the probability of a one-pass solution. Such 
designs can also be optimal subject to guaranteed error-detection require- 
ments. The magnitude of the improvement one can achieve by constraining 
the intersections is exemplified by the predicted performance of the cubic 



row-column pooling design Parillot, et al, 1991|| . Thus, we are exploring 
combinatoric optimization techniques for the construction of /c-sets packing 
designs. Some preliminary methods were used for optimizing the pooling de- 
sign for the Cla I, YAC library described in Section P]^. We plan to improve 
these by applying techniques such as the method of conditional expectations 
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to de-randomize generation of the designs ||Alon and Spencer, 1992 . 



In summary, although our prehminary results should prove useful, much 
exploratory work remains — based upon a better understanding of the preva- 
lent experimental errors — to achieve superior pooling designs, further reduc- 
ing the labor and increasing the efficiency of large-scale, library-screening 
experiments. Furthermore, pooling the clones from a pre-existing map will 
involve new challenges because of the prior knowledge about the joint prob- 
ability distribution for positive clones. 
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Figure Legends 

Figure la 

Title: "Incidence Matrix for a Pooling Design" 

A binary incidence matrix for a particular pooling scheme with six pools 
and five clones. A clone occurs in a pool if the corresponding matrix clement 
equals unity and does not occur in a pool if the corresponding matrix element 
equals zero. Pool number 1 contains clones number 1, 3, and 4, pool number 
4 contains only clone number 5, et cetera. 

Figure lb 

Title: "Terminology and Example of Categorization" 

The binary incidence matrix of Figure la is depicted, but now clones 1 
and 2 are taken to be positive clones and clones 3, 4 and 5 are taken to be 
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negative. The 'I's for the positive clones are replaced by '+'s, the 'I's for the 
negative clones are replaced by '-'s, and the 'O's are omitted. In the absence 
of experimental error, all the pools containing cither of the positive clones 
would be positive; thus, only pool 4 would be negative, as depicted above the 
horizontal line opposite 'Pool Assay'. This hypothetical poohng experiment 
would resolve the status of clones 1 and 5, and the remaining clones could be 
either positive or negative. Clone 5 is a negative clone occurring in pool 4, 
which contains no positive clone, thus it is a resolved negative clone. Clones 
3 and 4 are negative clones occurring only in pools containing positive clones; 
thus, they are both unresolved negative clones. Because clone 1 is a positive 
clone occurring only in pool 6, which does not contain either another positive 
clone or an unresolved negative clone, it is a resolved positive clone. Clone 
2 is a positive clone occurring only in pools containing cither other positive 
clones or unresolved negative clones; thus, it is unresolved positive. 

Figure 2 

Title: "Row-and-Column Pools of the Clones in a 96-well dish" 

A row and column design with 96 clones and 20 pools. Two clones are 
positive (row 3, column 3 and row 5, column 8) and hence the four indicated 
pools would be positive, in the absence of experimental error. There are two 
unresolved negative clones (row 5, column 3 and row 3, column 8) and 92 
resolved negative clones. The two positive clones are unresolved positive. 
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Figure 3 a 

Title for Figure 3a: "Minimum Number of Pools with the Expected Number 
of Resolved Positives Equal 0.5c" 

The X axis is n, the number of clones to be pooled; 1000 < n < 100, 000, 
and the y axis is c, the coverage parameter; 1/4 < c < 16. Both axes have a 
logarithmic scale, and the tics facing the inside of the plot are at integer values 
while those on the outside are at values corresponding to half integers. The 
smallest value of v such that a random /c-sets design can achieve the target 
expected number of 0.5c resolved positive clones is depicted. Equation (H) 
was used to generate the data for this plot. 

Figure 3b 

Title for Figure 3b: "Optimum k, Expected Number of Resolved Positives 
Equal 0.5c" 

The X axis is n, the number of clones to be pooled; 1000 < n < 100, 000, 
and the y axis is c, the coverage parameter; 1/4 < c < 16. Both axes have 
a logarithmic scale, and the tics facing the inside of the plot are at integer 
values while those on the outside are at values corresponding to half-integers. 
The values of k achieving the maximum expected number of resolved positive 
clones are depicted, with the value of v depicted in Figure 3a. These expected 
values slightly exceed 0.5c. For this domain of c and n, optimal values of k 
fall between 6 and 12. Equation was used to generate the data for this 
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plot. 

Figure 4a 

Title for Figure 4a: "Minimum Number of Pools with the Expected Number 
of Resolved Positives Equal 0.95c" 

The X axis is n, the number of clones to be pooled; 1000 < n < 100, 000, 
and the y axis is c, the coverage parameter; 1/4 < c < 16. Both axes have a 
logarithmic scale, and the tics facing the inside of the plot are at integer values 
while those on the outside are at values corresponding to half-integers. The 
smallest value of v such that a random fc-sets design can achieve the target 
expected number of 0.95c resolved positive clones is depicted. Equation (|^) 
was used to generate the data for this plot. 

Figure 4b 

Title for Figure 4b: "Optimum fc; Expected Number of Resolved Positive 
Clones Equal to 0.95c" 

The X axis is n, the number of clones to be pooled; 1000 < n < 100, 000, 
and the y axis is c, the coverage parameter; 1/4 < c < 16. Both axes have 
a logarithmic scale, and the tics facing the inside of the plot are at integer 
values while those on the outside are at values corresponding to half-integers. 
The values of k achieving the maximum expected number of resolved positive 
clones are depicted, with the value of v depicted in Figure 4a. These expected 
values slightly exceed 0.95c. For this domain of c and n, optimal values of k 
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fall between 7 and 12. Equation was used to generate the data for this 
plot. 



A Appendix: Derivations 

The inner summation in which equals the probability, Kjf \ that k spec- 
ified pools contain one or more positive clones when there are exactly p 
positive clones, can be evaluated via the recursive formula 



i=0 



for j < k, where Kj^2^ = 1 if j = otherwise Kj^2^ = 0. Equation is 
numerically advantageous because it involves no subtractions. 

We turn now to the expected number of unresolved positive clones, P. 
Let a^^^ denote the probability that a selected positive clone is unresolved, 
given that there are exactly p positive clones. When p = 1, the positive 
clone will be unresolved only if a negative clone occurs in precisely the same 
k pools, so that 

a.a..l-(l-g)y. (9) 

(In practice, a fc-sets design would usually be generated so that no two clones 
occupy precisely the same pools and hence a^^^ = 0, as described in Section 



273. Here, however, it is convenient to consider standard random k-sets 
designs.) 
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For p > 2, we determine a^^^ by conditioning on the values of three 
random variables U, X, and Y, where U is the number of unresolved negative 
clones, X is the number of pools which would be positive if the selected 
positive clone were removed, and Y is the number of positive pools. Thus, 
Y—X is the number of pools containing the selected positive clone but no 
other positive clone. The selected positive clone will be unresolved positive 
either if Y—X is zero or if each of these Y—X pools contains at least one 
unresolved negative clone. Thus, 

n—p V V 



a 



(p) 



PriA\U=u, X=x, Y=y)FTiU=u, X=x, Y=y), (10) 

u=0 x=k y=x 

where A denotes the event that every pool in which the selected positive 
clone occurs contains either a positive clone or an unresolved negative clone. 
There is an implicit conditioning on the number p of positive clones in each 
term of (|T0|). 



To evaluate (|T3), we use the equality 

Fi{U=u,X=x,Y=y) = Fi{U=u\X=x,Y=y)Fr{Y=y\X=x)Fi{X=x). 
Now 

FTiX=x) = QLt-:\ 
where L^-^'* denotes the probability that j specified pools are precisely the 
negative pools. By the inclusion-exclusion principle, 

L? = T.("~^)i-n-'^f, (11) 
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in which Zi is the probabihty, introduced at (0), that a given clone occurs in 
none of i specified pools. Therefore 

vv{x=x) = h g f^.Vi)^'^^^^r'- (12) 



^ ' l=V—X ^ ' 



As was the case for \ a recursive formula for L^-^-* is also available: 



I- 



fj \jj \k-i+jj 



in which L^^ = 1 if z = f , otherwise Lf^ = 0. Parenthetically, the inclusion- 
exclusion principle can be used to derive the Kj-''^ from the L^^^ and vice- versa. 



Given X=x, the distribution of Y is Hypergeometric: 

Further, given Y=y, the distribution of U is conditionally independent of X 
and is binomial: 

Fi{U=u\Y=y) = B{n-p, u, p), (15) 

where B{a,b,t) denotes the binomial probabilities (|l|) and where j3 is the 

probability that a given negative clone is unresolved negative, so that 

-1 

/ 7/ \ / 7) \ 



y\ f ' 
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Finally, for m > 0, 



FT{A\U=u,X=x,Y=y) = Y.(y .'']{-iy^l (16) 



i=o 

in which 



3 



y-j\ fy\ ^ 



\ k J \kj 

The RHS of (p!6D is, essentially, Ky^^, noting that each of the possible (^^ 
subsets of the y positive pools has equal probability of being the pools which 
contain a particular unresolved negative clone. 

Equation (D, follows from (D and (|lOD, together with (P), 



and (IIB); 



P = Y,pBin,p,c/n)a^P\ (17) 



where B{a,b,t) denotes the binomial probabilities (|T]). 
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